Udacity DAN Exploratory Data Analysis Red Wine Quality by Quentin THOMAS ========================================================
This report explores a dataset containing chemical compositions and measurements for approximately 1 600 red wines.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our dataset consists of 13 variables, for 1599 red wines.Because it seems that there are a lot of outliers, I decide to remove the top 1% of some variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.600 Min. :0.1200 Min. :0.0000
## 1st Qu.: 411.2 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.0900
## Median : 810.5 Median : 7.900 Median :0.5200 Median :0.2500
## Mean : 806.1 Mean : 8.329 Mean :0.5202 Mean :0.2677
## 3rd Qu.:1199.8 3rd Qu.: 9.200 3rd Qu.:0.6300 3rd Qu.:0.4200
## Max. :1599.0 Max. :15.900 Max. :1.0100 Max. :0.7900
## residual.sugar chlorides free.sulfur.dioxide
## Min. :0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.:1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median :2.200 Median :0.07900 Median :13.00
## Mean :2.426 Mean :0.08285 Mean :15.13
## 3rd Qu.:2.600 3rd Qu.:0.08900 3rd Qu.:21.00
## Max. :8.300 Max. :0.35800 Max. :47.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.860 Min. :0.3300
## 1st Qu.: 21.00 1st Qu.:0.9956 1st Qu.:3.220 1st Qu.:0.5500
## Median : 36.50 Median :0.9967 Median :3.310 Median :0.6200
## Mean : 43.78 Mean :0.9967 Mean :3.316 Mean :0.6452
## 3rd Qu.: 59.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7200
## Max. :143.00 Max. :1.0032 Max. :4.010 Max. :1.1600
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.45 Mean :5.661
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
X variable is not used. I remove it from the dataframe.
Following the french wikipedia page about the dioxide sulfur in oenology (https://fr.wikipedia.org/wiki/Dioxyde_de_soufre_en_œnologie), it is possible to infer the sulfur combination from total and free sulfur values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 20.00 28.65 37.00 128.00
Now I can display the spread of these variables.
Quality: Score based on sensory data between 0 and 10. However our dataset only show note from 3 to 8. Regarding the low diversity of scores, the histogram shape is close to normal distribution.
Alcohol: The percent alcohol content of the wine. The distribution is skewed to the right.
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. The distribution is normal (bell shaped).
density: the density of water is close to that of water depending on the percent alcohol and sugar content. The distribution is normal (bell shaped).
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The distribution is skewed to the right.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily). The distribution is skewed to the right.
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The distribution is skewed to the right.
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. The distribution is skewed to the right.
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. The distribution is skewed to the right.
As the total and free sulfur dioxide, the combination distribution is skewed to the right.
chlorides: the amount of salt in the wine. The distribution is skewed to the right however I think that it is due to outliners. I want to try a log10 transformation:
log10(chlorides): This looks more like a normal distribution.
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. The distribution is skewed to the right.
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. The distribution is more or less bimodal.
This data set contains 1,599 red wines with 13 variables on the chemical properties of the wine. It is interesting to notice that even if the quality is rated by experts, it is still an subjective variable.
Most of the variables are Numerics, the others are Integer. Except PH and density, the other variables are mainly screwed to the right.
Which chemical properties influence the quality of red wines?
According to the associated text file, I think that acidity, citric acid and chlorides should have an impact on the wine quality. I am also curious to see the correlation between sugar and alcohol.
Following the french wikipedia page about the dioxide sulfur in oenology (https://fr.wikipedia.org/wiki/Dioxyde_de_soufre_en_œnologie), I have created the sulfur combination variable wich is equal to the total sulfur minus the free sulfur value.
I was not sure about the chlorides distribution, so I used a log10 transformation. As the X variable is not used, I have also removed it from the dataframe.
Only alcohol, sulphates and volatil acidity have a meaningful correlation with quality. As a reminder:
The correlation is good but regarding the discrete nature of quality variable it is better to use boxplots.
We can observe that the quality of wines improve when alcohol and sulphates increase and volatile acidity decrease.
Now I want to find new correlation for these three variables.
As correlation with quality has already been studied, I only check the correlation with density.
Alcohol tends to decrease while density increase.
Interesting, sulphates have a acceptable correlation with volatile acidity which is already correlated with the quality. Citric acid is also qualified.
Volatile acidity decreases as sulphates increases.
Citric acid increases with sulphates.
As for the sulphates, citric acid is correlated with volatile acidity.
Citric acid tends to decrease as volatile acidity increases.
I have found only three variables which are correlated with the quality: * Alcohol, * Sulphates, * Volatile acidity.
Volatile acidty was described in the associated text file as something which can lead to an unpleasant, vinegar taste, so I am not surprise about its correlation, and the boxplot clearly shows that it is a variable that you should keep low if you want a good wine.
In the other hand, Sulphates and Alcohol seem to improve the quality of the wine.
I wanted to explore these three variables to discover new correlations and I found a trio with: Citric acid, Sulphates, Volatile acidity.
As expected because described in the text file, Alcohol is correlated with density. However I am surprised that residual sugar has no impact on it.
The strongest relationship I found was about the wine citric acid and the wine volatile acidity.
The association of volatile acidity and alcohol seems to be a good classifier. We can clearly observe darker points (that means quality) on the right bottom of the graphic. Therefore a high alcohol with a low volatile acidity is a good attribute for wine.
Once again a good classifier which use alcohol and density. Even better than the previous one. Density, as volatile acidity, has to be keeped low.
As volatile acidity and acid citric are correlated variables I expected that PH will impact the quality of wine. This graphic shows that it is not the case and wines of each quality can have a PH between 3 and 4 without impact on the quality.
This graphic shows that sulphates have a strong impact on the quality of the wines and are present at a high rate for the betters.
This graphic is a good synthesis and I will keep it for the final plot. Maybe it will be better if I group the quality rating. We can observe on it the three variables which impact directly the quality of wines.
Volatile acidity vs Alcohol and Density vs Alcohol plots show very well that a classification is possible only with these variables.
The graphic representation with the three best correlated variables for quality speaks for itself: high sulphates and alcohol and low volatile acidity produce the best wines.
I expected that the PH could have a role in the wine taste, but the graphics show that wines are more or less well balanced.
As the associated text file explain, volatile acidity has a direct impact on the quality of the wine. As the volatile acidity decrease, quality of wine increase.
This graphic is really interesting as it can almost be used as a classifier. It shows how alcohol (and its correlated variable density) has a strong affect on wine (small density and high alcohol).
By far my favourite visualisation as it shows the combined action of the three correlated variables and can give a simple rule for wine making (low volatile acidity, high alcohol and enough sulphates).
This dataset contains 1 599 observations about red wines and their chemical properties. The variable I tried to explain, and which I assume that was the most interesting one was the quality rating.
I faced a first issue just by reading the text file description. The quality variable is a rating given by wine tasters. “Expert wine tasters” in order to be more precise, but it is still a subjective opinion. For this reason I expected to find less correlated variables.
It was indeed the case, and I found only three correlated variables with quality (alcohol, volatile acidity and sulphates) and one hidden because not in direct correlation with quality (acid citric).
Alcohol: I was surprised about this positive correlation, as I alway thougt that high alcohol wine were bad one. I assume that it can make the taste stronger.
Sulphates: Sulphates has a positive correlation with wine quality. Sulphates can be used at different time during the wine making process for their antioxydant and antiseptic properties (french website http://www.carnetdevins.fr/guide-vin-naturel/soufre-sulfites/).
Volatile acidity: Volatile acidity has a negative correlation to wine quality. It is not a suprise as it is explain in the text file that too high of volatile acidity levels can lead to an unpleasant, vinegar taste.
Citic acids: Citric acid does not have a direct correlation with quality but has a correlation with sulphates(positively) and volatile acidity (negatively). So in a way it impact the quality at the end.
Because there are not a lot of correlated variables, and because the explained variables is subjective, it shows that wine is complexe and maybe should deserve a bigger survey with different categories rating instead of a single and rounded global rating.